We’re going to load in some basic packages. We use
clusterProfiler to do the initial gene set analysis and the
test data we’re using is human so we’re loading the
org.Hs.eg.db annotation package - you’d modify that to use
the package for the species you’re using for your data.
We’re using tidyverse for general data loading and manipulation - this isn’t required but it’s nice.
library(geneSetExtras)
library(org.Hs.eg.db)
#> Loading required package: AnnotationDbi
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, aperm, append, as.data.frame, basename, cbind,
#> colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
#> get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
#> match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
#> tapply, union, unique, unsplit, which.max, which.min
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: IRanges
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> expand.grid, I, unname
#>
library(clusterProfiler)
#>
#> clusterProfiler v4.18.4 Learn more at https://yulab-smu.top/contribution-knowledge-mining/
#>
#> Please cite:
#>
#> G Yu. Thirteen years of clusterProfiler. The Innovation. 2024,
#> 5(6):100722
#>
#> Attaching package: 'clusterProfiler'
#> The following object is masked from 'package:AnnotationDbi':
#>
#> select
#> The following object is masked from 'package:IRanges':
#>
#> slice
#> The following object is masked from 'package:S4Vectors':
#>
#> rename
#> The following object is masked from 'package:stats':
#>
#> filter
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 4.0.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
#> ✔ purrr 1.1.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ lubridate::%within%() masks IRanges::%within%()
#> ✖ dplyr::collapse() masks IRanges::collapse()
#> ✖ dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
#> ✖ dplyr::desc() masks IRanges::desc()
#> ✖ tidyr::expand() masks S4Vectors::expand()
#> ✖ dplyr::filter() masks clusterProfiler::filter(), stats::filter()
#> ✖ dplyr::first() masks S4Vectors::first()
#> ✖ dplyr::lag() masks stats::lag()
#> ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
#> ✖ purrr::reduce() masks IRanges::reduce()
#> ✖ dplyr::rename() masks clusterProfiler::rename(), S4Vectors::rename()
#> ✖ lubridate::second() masks S4Vectors::second()
#> ✖ lubridate::second<-() masks S4Vectors::second<-()
#> ✖ dplyr::select() masks clusterProfiler::select(), AnnotationDbi::select()
#> ✖ purrr::simplify() masks clusterProfiler::simplify()
#> ✖ dplyr::slice() masks clusterProfiler::slice(), IRanges::slice()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorsThe format of the data here isn’t important, we just need some way to define the background and hit lists for each group.
read_delim(system.file("hit_clusters.txt", package="geneSetExtras")) -> hit_data
hit_data |>
group_by(cluster) |>
count()
#> # A tibble: 8 × 2
#> # Groups: cluster [8]
#> cluster n
#> <chr> <int>
#> 1 cluster_11 383
#> 2 cluster_13 626
#> 3 cluster_15 271
#> 4 cluster_19 349
#> 5 cluster_4 822
#> 6 cluster_5 144
#> 7 cluster_8 677
#> 8 <NA> 20975We need to generate an initial set of results from
clusterProfiler. We’re doing overlap (rather than
quantitative) statistics here. We should use the same settings for each
of the tests. It’s important that we don’t filter the hits to keep only
significant results but get all of the data for all tested
categories.
It’s a good idea to limit the size of the gene sets used for the tests. ClusterProfiler does this by default anyway.
In this instance we’ll just use the GO:BP subsection. We’re just going to use all significant genes in this test, but you can be more selective. Depending on the nature of your data it may make sense to split out upregulated and downregulated genes.
We’ll calculate results just for cluster 4 to illustrate how this works.
Now we have individual hits we can plot them out using a volcano plot. It’s often useful to make these interactive so that you can put them into Notebooks and mouse over individual hits to see what they are.
Next we want to do some categorical clustering to find hits which are similar in their gene content and then pick a suitable term to represent each of them.
cluster_enrich_result(enrich_cluster_4) -> clusterered_results_cluster4
#> Joining with `by = join_by(Description)`
clusterered_results_cluster4
#> # A tibble: 24 × 14
#> ID Description term_cluster term_cluster_size GeneRatio BgRatio RichFactor
#> <chr> <chr> <int> <int> <chr> <chr> <dbl>
#> 1 GO:0… cell-cell … 1 2 45/628 233/14… 0.193
#> 2 GO:0… modulation… 2 19 59/628 404/14… 0.146
#> 3 GO:0… regulation… 3 66 55/628 363/14… 0.152
#> 4 GO:0… synapse or… 4 6 52/628 425/14… 0.122
#> 5 GO:0… adenylate … 5 6 26/628 167/14… 0.156
#> 6 GO:0… sensory pe… 6 17 44/628 414/14… 0.106
#> 7 GO:0… neurotrans… 7 10 24/628 168/14… 0.143
#> 8 GO:0… response t… 8 7 37/628 348/14… 0.106
#> 9 GO:0… response t… 9 4 18/628 109/14… 0.165
#> 10 GO:0… potassium … 10 6 22/628 170/14… 0.129
#> # ℹ 14 more rows
#> # ℹ 7 more variables: FoldEnrichment <dbl>, zScore <dbl>, pvalue <dbl>,
#> # p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
cluster_volcano_plot(clusterered_results_cluster4, interactive = TRUE)We have several clusters, so it would be nice to show the results for all of them. This isn’t specific to the package, but shows how this could be done.
calculate_cluster_enrichment <- function(data,cluster_name) {
enrichGO(
gene = data |> filter(cluster==cluster_name) |> pull(Gene),
universe = data |> pull(Gene),
OrgDb = org.Hs.eg.db,
keyType = "SYMBOL",
pvalueCutoff = 1,
qvalueCutoff = 1,
readable = TRUE,
ont = "BP"
)
}
hit_data |> arrange(cluster)|> filter(!is.na(cluster)) |> distinct(cluster) |> pull(cluster) -> cluster_names
lapply(
cluster_names,
function(x)calculate_cluster_enrichment(hit_data,x)
) -> all_enrichment_results
names(all_enrichment_results) <- cluster_namesWe can then plot out the results for each cluster
for (name in cluster_names) {
print(volcano_plot(all_enrichment_results[[name]], interactive=TRUE, title=name))
}We can also generate clustered results
lapply(
all_enrichment_results,
cluster_enrich_result
) -> all_clustered_hits
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`
#> Joining with `by = join_by(Description)`
for (i in 1:length(cluster_names)) {
if (! is.null(all_clustered_hits[[i]])) {
all_clustered_hits[[i]] |>
add_column(cluster=cluster_names[i], .before=1) -> all_clustered_hits[[i]]
print(all_clustered_hits[[i]])
}
}
#> # A tibble: 7 × 15
#> cluster ID Description term_cluster term_cluster_size GeneRatio BgRatio
#> <chr> <chr> <chr> <int> <int> <chr> <chr>
#> 1 cluster_11 GO:00… extracellu… 1 4 16/214 284/14…
#> 2 cluster_11 GO:00… blood circ… 2 15 20/214 424/14…
#> 3 cluster_11 GO:00… response t… 3 34 21/214 464/14…
#> 4 cluster_11 GO:00… C21-steroi… 4 14 4/214 16/140…
#> 5 cluster_11 GO:00… positive r… 5 2 4/214 21/140…
#> 6 cluster_11 GO:00… L-glutamat… 6 4 4/214 25/140…
#> 7 cluster_11 GO:19… cellular r… 7 2 7/214 93/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> # pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 9 × 15
#> cluster ID Description term_cluster term_cluster_size GeneRatio BgRatio
#> <chr> <chr> <chr> <int> <int> <chr> <chr>
#> 1 cluster_13 GO:00… extracellu… 1 10 37/431 284/14…
#> 2 cluster_13 GO:00… complement… 2 5 13/431 37/140…
#> 3 cluster_13 GO:00… regulation… 3 11 18/431 128/14…
#> 4 cluster_13 GO:00… skeletal s… 4 7 35/431 458/14…
#> 5 cluster_13 GO:00… positive r… 5 2 35/431 478/14…
#> 6 cluster_13 GO:00… renal syst… 6 9 25/431 287/14…
#> 7 cluster_13 GO:00… vascular e… 7 5 8/431 43/140…
#> 8 cluster_13 GO:00… oxygen tra… 8 3 4/431 10/140…
#> 9 cluster_13 GO:00… positive r… 9 2 8/431 55/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> # pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 11 × 15
#> cluster ID Description term_cluster term_cluster_size GeneRatio BgRatio
#> <chr> <chr> <chr> <int> <int> <chr> <chr>
#> 1 cluster_15 GO:0… nuclear di… 1 95 76/243 405/14…
#> 2 cluster_15 GO:0… DNA replic… 2 24 39/243 270/14…
#> 3 cluster_15 GO:0… meiotic ce… 3 17 34/243 244/14…
#> 4 cluster_15 GO:0… centromere… 4 6 11/243 28/140…
#> 5 cluster_15 GO:1… positive r… 5 14 11/243 28/140…
#> 6 cluster_15 GO:0… cell cycle… 6 21 20/243 146/14…
#> 7 cluster_15 GO:0… sister chr… 7 3 12/243 54/140…
#> 8 cluster_15 GO:0… positive r… 8 6 9/243 26/140…
#> 9 cluster_15 GO:0… centrosome… 9 8 15/243 134/14…
#> 10 cluster_15 GO:0… DNA biosyn… 10 2 14/243 181/14…
#> 11 cluster_15 GO:0… cytokineti… 11 2 7/243 42/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> # pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 3 × 15
#> cluster ID Description term_cluster term_cluster_size GeneRatio BgRatio
#> <chr> <chr> <chr> <int> <int> <chr> <chr>
#> 1 cluster_19 GO:00… roof of mo… 1 21 11/279 83/140…
#> 2 cluster_19 GO:00… negative r… 2 6 21/279 395/14…
#> 3 cluster_19 GO:00… negative r… 3 4 6/279 45/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> # pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 24 × 15
#> cluster ID Description term_cluster term_cluster_size GeneRatio BgRatio
#> <chr> <chr> <chr> <int> <int> <chr> <chr>
#> 1 cluster_4 GO:00… cell-cell … 1 2 45/628 233/14…
#> 2 cluster_4 GO:00… modulation… 2 19 59/628 404/14…
#> 3 cluster_4 GO:00… regulation… 3 66 55/628 363/14…
#> 4 cluster_4 GO:00… synapse or… 4 6 52/628 425/14…
#> 5 cluster_4 GO:00… adenylate … 5 6 26/628 167/14…
#> 6 cluster_4 GO:00… sensory pe… 6 17 44/628 414/14…
#> 7 cluster_4 GO:00… neurotrans… 7 10 24/628 168/14…
#> 8 cluster_4 GO:00… response t… 8 7 37/628 348/14…
#> 9 cluster_4 GO:00… response t… 9 4 18/628 109/14…
#> 10 cluster_4 GO:00… potassium … 10 6 22/628 170/14…
#> # ℹ 14 more rows
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> # pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>
#> # A tibble: 12 × 15
#> cluster ID Description term_cluster term_cluster_size GeneRatio BgRatio
#> <chr> <chr> <chr> <int> <int> <chr> <chr>
#> 1 cluster_8 GO:00… sterol bio… 1 22 20/462 58/140…
#> 2 cluster_8 GO:00… endoderm f… 2 6 10/462 52/140…
#> 3 cluster_8 GO:00… regulation… 3 12 9/462 46/140…
#> 4 cluster_8 GO:00… myofibril … 4 4 10/462 62/140…
#> 5 cluster_8 GO:00… axon guida… 5 18 19/462 199/14…
#> 6 cluster_8 GO:00… angiogenes… 6 14 33/462 470/14…
#> 7 cluster_8 GO:00… intermedia… 7 3 9/462 52/140…
#> 8 cluster_8 GO:00… organophos… 8 4 19/462 205/14…
#> 9 cluster_8 GO:00… organic hy… 9 9 19/462 231/14…
#> 10 cluster_8 GO:00… negative r… 11 10 7/462 42/140…
#> 11 cluster_8 GO:00… negative r… 12 2 10/462 88/140…
#> 12 cluster_8 GO:00… ovulation 13 2 4/462 14/140…
#> # ℹ 8 more variables: RichFactor <dbl>, FoldEnrichment <dbl>, zScore <dbl>,
#> # pvalue <dbl>, p.adjust <dbl>, qvalue <dbl>, geneID <chr>, Count <int>Next we want to look for terms whose enrichment changes significantly between different datasets. We can do this by passing a list of the enrichment results we want to compare. Only terms which are in all sets can be tested, which is why it’s important that the original enrichment is calculated without a pvalue filter.
*NB There is an issue with clusterProfiler where it doesn’t report categories with no overlaps in the gene set, however you set the filters. This may be fixed in the latest devel version but we’re waiting to see. At the moment these results may under-represent the true hits.
compare_enrichment_results(all_enrichment_results) -> enrichment_differences
#> Joining with `by = join_by(ID)`
enrichment_differences |>
filter(p.adjust<0.05)
#> # A tibble: 281 × 11
#> ID Description p.adjust pvalue FoldEnrichment_clust…¹
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 GO:0007059 chromosome segregation 2.53e-49 4.55e-53 0
#> 2 GO:0000280 nuclear division 6.32e-41 2.27e-44 0.649
#> 3 GO:0098813 nuclear chromosome segre… 1.53e-40 8.26e-44 0
#> 4 GO:0048285 organelle fission 4.70e-39 3.38e-42 0.581
#> 5 GO:0140014 mitotic nuclear division 3.32e-34 2.98e-37 0.738
#> 6 GO:0000819 sister chromatid segrega… 9.01e-34 9.71e-37 0
#> 7 GO:0000070 mitotic sister chromatid… 4.88e-33 6.14e-36 0
#> 8 GO:0033044 regulation of chromosome… 2.32e-22 3.33e-25 0.272
#> 9 GO:0051983 regulation of chromosome… 2.20e-20 3.55e-23 0
#> 10 GO:0044772 mitotic cell cycle phase… 2.73e-20 4.89e-23 0.485
#> # ℹ 271 more rows
#> # ℹ abbreviated name: ¹FoldEnrichment_cluster_11
#> # ℹ 6 more variables: FoldEnrichment_cluster_13 <dbl>,
#> # FoldEnrichment_cluster_15 <dbl>, FoldEnrichment_cluster_19 <dbl>,
#> # FoldEnrichment_cluster_4 <dbl>, FoldEnrichment_cluster_5 <dbl>,
#> # FoldEnrichment_cluster_8 <dbl>We can see that these are redundant because we’re using all terms to test with.
Let’s pick out only the terms which were in our clustered hits and test just those.
lapply(all_clustered_hits, function(x)x$ID) -> interesting_ids
do.call(c,interesting_ids) |> unique() -> interesting_ids
lapply(all_enrichment_results, function(x)x |> filter(ID %in% interesting_ids)) -> filtered_enrichment_results
compare_enrichment_results(filtered_enrichment_results) -> filtered_enrichment_differences
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Joining with `by = join_by(ID)`
filtered_enrichment_differences |>
filter(p.adjust<0.05)
#> # A tibble: 48 × 11
#> ID Description p.adjust pvalue FoldEnrichment_clust…¹
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 GO:0000280 nuclear division 1.43e-42 2.27e-44 0.649
#> 2 GO:0006260 DNA replication 3.17e-21 1.01e-22 0.486
#> 3 GO:0051321 meiotic cell cycle 3.42e-16 1.63e-17 0.538
#> 4 GO:0050804 modulation of chemical s… 7.71e- 8 5.45e- 9 1.14
#> 5 GO:0007062 sister chromatid cohesion 7.71e- 8 6.12e- 9 1.22
#> 6 GO:0044839 cell cycle G2/M phase tr… 1.01e- 7 9.66e- 9 0.450
#> 7 GO:0007098 centrosome cycle 3.34e- 7 3.71e- 8 0
#> 8 GO:1905820 positive regulation of c… 7.87e- 7 1.00e- 7 0
#> 9 GO:0034508 centromere complex assem… 1.11e- 6 1.58e- 7 0
#> 10 GO:0098742 cell-cell adhesion via p… 1.92e- 6 3.05e- 7 1.41
#> # ℹ 38 more rows
#> # ℹ abbreviated name: ¹FoldEnrichment_cluster_11
#> # ℹ 6 more variables: FoldEnrichment_cluster_13 <dbl>,
#> # FoldEnrichment_cluster_15 <dbl>, FoldEnrichment_cluster_19 <dbl>,
#> # FoldEnrichment_cluster_4 <dbl>, FoldEnrichment_cluster_5 <dbl>,
#> # FoldEnrichment_cluster_8 <dbl>Now we’re down to just 48 terms which is a lot more manageable.